BITHub has two aims
Allowing the comparison of expression of gene (or genes) of interest across multiple datasets
Allowing the comparison of expression against metadata variables in a given data-set
For the second aim, it is vital that BITHub contains the relevant metadata annotations. BITHub aims to provide 3 types of metadata annotations for each dataset in the web-browser:
Phenotype annotations:
Sequencing metrics
Sample characteristics These annotations relate to how the samples were experimentally prepared
In order to ensure the metadata information is displayed in a user-friendly, highly correlating metadata annotations will be removed and a subset will be used for the site. Additionally, we will also perform varianceParition analysis on the subsetted list.
source("functions.R")
library(pander)
bseq = read.csv("/home/neuro/Documents/BrainData/Bulk/Brainseq/Formatted/BrainSeq-metadata.csv", header = TRUE, row.names = 1)
bseq %<>%
dplyr::select(-c(FQCbasicStats, perSeqQual, SeqLengthDist,KmerContent))
M = cor(data.matrix(bseq), use = "complete.obs")
corrplot(M, order = 'AOE')
Correlation plot of metadata annotations from BrainSeq phase II. The metadata annotations are clustered based on correlation
Prior to running cor() function, the FQCbasicStats,
perSeqQual, SeqLengthDist and KmerContent columns were removed as they
contained the same value, resulting in NA.
BrainSeq metadata annotations shows duplicate information in many of its columns (e.g SampleID, SAMPLEID), which are likely a result of running the pre-processing pipeline for BITHub. Additionally, certain columns contain very similar information thus resulting in high correlation. Several QC metrics for RNA-seq QC also provide redundant information and they will be removed for downstream analysis.
The final BrainSeq annotations will contain the following columns:
bseq.annot = read.csv("../annotations/BrainSeq-metadata-annot.csv") %>%
dplyr::filter(Include..Yes.No....Interest == "Yes") %>%
dplyr::select(-c(Include..Yes.No....Interest))
bseq.annot %>%
pander(justify = "lll",
style = "rmarkdown",
caption = "BrainSeq metadata annotations that will be used for BITHub ")
| OriginalMetadataColumnName | BITColumnName | Type |
|---|---|---|
| X | SampleID | Sample charactertics |
| trimmed | trimmed | Sequencing metrics |
| numReads | TotalNReads | Sequencing metrics |
| numMapped | numMapped | Sequencing metrics |
| numUnmapped | numUnmapped | Sequencing metrics |
| overallMapRate | MappingRate | Sequencing metrics |
| concordMapRate | concordMapRate | Sequencing metrics |
| totalMapped | totalMapped | Sequencing metrics |
| mitoMapped | mitoMapped | Sequencing metrics |
| mitoRate | mito_Rate | Sequencing metrics |
| totalAssignedGene | totalAssignedGene | Sequencing metrics |
| rRNA_rate | rRNA_rate | Sequencing metrics |
| RNum | SampleID | Phenotype |
| Region | StructureAcronym | Sample charactertics |
| RIN | RIN | Sequencing metrics |
| Age | AgeNumeric | Phenotype |
| Sex | Sex | Phenotype |
| Race | Ethnicity | Phenotype |
| Dx | Diagnosis | Phenotype |
| Fetal_replicating | Dev.Replicating | Sample charactertics |
| Fetal_quiescent | Dev.Quiescent | Sample charactertics |
| OPC | Adult.OPC | Sample charactertics |
| Neurons | Adult.Neurons | Sample charactertics |
| Astrocytes | Adult.Astrocytes | Sample charactertics |
| Oligodendrocytes | Adult.Oligo | Sample charactertics |
| Microglia | Adult.Microglia | Sample charactertics |
| Endothelial | Adult.Endothelial | Sample charactertics |
| NA | AgeInterval | Phenotype |
| NA | Period | Phenotype |
| NA | Regions | Sample charactertics |
bseq %<>% dplyr::select(contains(bseq.annot$BITColumnName))
write.csv(bseq, file = "/home/neuro/Documents/BrainData/Bulk/Brainseq/Formatted/BrainSeq-metadata-subset.csv")
bspan = read.csv("/home/neuro/Documents/BrainData/Bulk/BrainSpan/Formatted/BrainSpan-metadata.csv", header = TRUE, row.names = 1)
M = bspan %>% dplyr::select(-c("Diagnosis")) %>% data.matrix() %>% cor(.,use = "complete.obs")
corrplot(M, order='AOE')
BrainSpan metadata annotations contain several duplicate and redundant columns that essentially contain the same information (e.g column_num, Age.x, Braincode). BrainSpan annotations were retrieved from multiple sources and as such, these duplicates are likely a result of different IDs they were stored under.
The following BrainSpan metadata annotations will be used for BITHub:
bspan.annot = read.csv("../annotations/BrainSpan-metadata-annot.csv") %>%
dplyr::filter(Include..Yes.No....Interest == "Yes") %>%
dplyr::select(-c(Include..Yes.No....Interest))
bspan.annot %>%
pander(justify = "lll",
style = "rmarkdown",
caption = "BrainSpan metadata annotations that will be used for BITHub ")
| OriginalMetadataColumnName | BITColumnName | Type |
|---|---|---|
| SampleID | SampleID | Sample characteristics |
| gender | Sex | Phenotype |
| structure_acronym | StructureAcronym | Sample characteristics |
| NA | Period | Phenotype |
| NA | AgeNumeric | Phenotype |
| NA | AgeInterval | Phenotype |
| NA | Diagnosis | Phenotype |
| NA | Regions | Sample characteristics |
| NA | mRIN | Sequencing metrics |
| Hemisphere | Hemisphere | Sample characteristics |
| RIN | RIN | Sequencing metrics |
| PMI | PMI | Sequencing metrics |
| pH | pH | Sequencing metrics |
| Ethnicity | Ethnicity | Phenotype |
bspan %<>% dplyr::select(contains(bspan.annot$BITColumnName))
write.csv(bspan, file = "/home/neuro/Documents/BrainData/Bulk/BrainSpan/Formatted/BrainSpan-metadata-subset.csv")
gtex = read.csv("/home/neuro/Documents/BrainData/Bulk/GTEx/Formatted/GTEx-metadata.csv", header = TRUE, row.names = 1)
M = cor(data.matrix(gtex))
## Warning in cor(data.matrix(gtex)): the standard deviation is zero
corrplot(M, method = 'number')
The GTEx metadata contains comprehensive annotations of sample, sequencing and phenotype attributes. However, for BITHub, we will remove metadata annotations that have a strong positive correlation.
The following metadata annotations will be used for GTEx:
gtex.annot = read.csv("../annotations/GTEx-metadata-annot.csv") %>%
dplyr::filter(Include..Yes.No....Interest == "Yes") %>%
dplyr::select(-c(Include..Yes.No....Interest))
gtex %<>% dplyr::select(contains(gtex.annot$BITColumnName))
gtex.annot %>%
pander(justify = "lll",
style = "rmarkdown",
caption = "GTEx metadata annotations that will be used for BITHub ")
| OriginalMetadataColumnName | BITColumnName |
|---|---|
| SAMPID | SampleID |
| SMRIN | RIN |
| SMTSISCH | PMI |
| AGE | AgeInterval |
| SEX | Sex |
| SMATSSCR | AutolysisScore |
| SMNABTCH | IsolationBatchID |
| SMNABTCHT | TypeofBatch |
| SMNABTCHD | DateofBatch |
| SMGEBTCH | Genotype_or_Expression_Batch_ID |
| SMGEBTCHD | DateofGenotypeorExpressionBatch |
| SMGEBTCHT | TypeofGenotypeorExpressionBatch |
| SMCENTER | BSS_Collection_side_code |
| SMTSPAX | Time_spent_in_PAXgene_fixative |
| SME2MPRT | End_2_mapping_rate |
| SMCHMPRS | ChimericPairs |
| SMNTRART | IntragenicRate |
| SMNUMGPS | No_of_Gaps |
| SMMAPRT | MappingRate_total |
| SMEXNCRT | ExonicRate |
| SM550NRM | BasedNormalised |
| SMGNSDTC | GenesDetected |
| SMUNMPRT | Rate_of_mapped_genes_unique |
| SM350NRM | BaseNormilization |
| SMESTLBS | LibrarySize |
| SMMPPD | ReadsMapped |
| SMNTERRT | IntergenicRate |
| SMRRNANM | rRNA |
| SMRDTTL | TotalNReads |
| SMMNCV | Mean_Coeff_Variation |
| SMTRSCPT | TranscriptsDetected |
| SMMPPDPR | MappedPairs |
| SMUNPDRD | UnpairedReads |
| SMNTRNRT | IntronicRate |
| SMMPUNRT | Mapped_unique_rate_of_total |
| SMEXPEFF | ExpressionProfilingEfficiency |
| SMMPPDUN | MappedUnique_no_dup_flags |
| SME2MMRT | End_2_Mismatch_Rate |
| SME2ANTI | End_2_Antisense |
| SME2SNSE | End_Sense_2 |
| SME1ANTI | End_1_Antisense |
| SME1SNSE | End_1_Sense |
| SME1PCTS | End_1_Sense_percentage |
| SMRRNART | rRNA_rate |
| SME1MPRT | End_1_Mapping_rate |
| SMNUM5CD | Num_of_Reads_Covered_5prime |
| SMDPMPRT | DuplicationRateMapped |
| SME2PCTS | Percentage_IntragenicEnd_2_Reads |
| DTHHRDY | HardyScale |
| Type |
|---|
| Sample charactertics |
| Sequencing metrics |
| Sequencing metrics |
| Phenotype |
| Phenotype |
| Sample charactertics |
| Sample charactertics |
| Sample charactertics |
| Sample charactertics |
| Sample charactertics |
| Sample charactertics |
| Sample charactertics |
| Sample charactertics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Sequencing metrics |
| Phenotype |
write.csv(bspan, file = "/home/neuro/Documents/BrainData/Bulk/GTEx/Formatted/GTEx-metadata-subset.csv")
pe = read.csv("/home/neuro/Documents/BrainData/Bulk/PsychEncode/Formatted/PsychEncode-metadata.csv", header = TRUE, row.names = 1)
M = pe %>%
dplyr::select(-c(ageBiopsy, smellTestScore,smoker,Structure, StructureAcronym, Regions, Capstone_4, Adult.In7)) %>%
data.matrix() %>% cor(.,use ='pairwise.complete.obs' )
## Warning in cor(., use = "pairwise.complete.obs"): the standard deviation is
## zero
corrplot(M)
Information regarding Row_IDs, Row_Versions, Contributing Studes and
Notes will be removed for BITHub.
The following metadata annotations will be retained for PsychEncode:
pe.annot = read.csv("../annotations/PsychEncode-metadata-annot.csv") %>%
dplyr::filter(Include..Yes.No....Interest == "Yes") %>%
dplyr::select(-c(Include..Yes.No....Interest))
pe %<>% dplyr::select(contains(pe.annot$BITColumnName))
pe.annot %>%
pander(justify = "llr",
style = "rmarkdown",
caption = "PsychEncode metadata annotations that will be used for BITHub ")
| OriginalColumnName | BITColumnName | Type |
|---|---|---|
| individualID | SampleID | Sample charactertics |
| diagnosis | Diagnosis | Phenotype |
| sex | Sex | Phenotype |
| ethnicity | Ethnicity | Phenotype |
| ageDeath | AgeNumeric | Phenotype |
| Adult.Ex1 | Adult.Ex1 | Sample charactertics |
| Adult.Ex2 | Adult.Ex2 | Sample charactertics |
| Adult.Ex3 | Adult.Ex3 | Sample charactertics |
| Adult.Ex4 | Adult.Ex4 | Sample charactertics |
| Adult.Ex5 | Adult.Ex5 | Sample charactertics |
| Adult.Ex6 | Adult.Ex6 | Sample charactertics |
| Adult.Ex7 | Adult.Ex7 | Sample charactertics |
| Adult.Ex8 | Adult.Ex8 | Sample charactertics |
| Adult.In1 | Adult.In1 | Sample charactertics |
| Adult.In2 | Adult.In2 | Sample charactertics |
| Adult.In3 | Adult.In3 | Sample charactertics |
| Adult.In4 | Adult.In4 | Sample charactertics |
| Adult.In5 | Adult.In5 | Sample charactertics |
| Adult.In6 | Adult.In6 | Sample charactertics |
| Adult.In7 | Adult.In7 | Sample charactertics |
| Adult.In8 | Adult.In8 | Sample charactertics |
| Adult.Astrocytes | Adult.Astrocytes | Sample charactertics |
| Adult.Endothelial | Adult.Endothelial | Sample charactertics |
| Dev.Quiescent | Dev.Quiescent | Sample charactertics |
| Dev.Replicating | Dev.Replicating | Sample charactertics |
| Adult.Microglia | Adult.Microglia | Sample charactertics |
| Adult.OtherNeuron | Adult.OtherNeuron | Sample charactertics |
| Adult.OPC | Adult.OPC | Sample charactertics |
| Adult.Oligo | Adult.Oligo | Sample charactertics |
| structure_acronym | StructureAcronym | Sample charactertics |
| ageOnset | ageOnset | Phenotype |
| causeDeath | causeDeath | Phenotype |
| brainWeight | brainWeight | Phenotype |
| height | height | Phenotype |
| weight | weight | Phenotype |
| ageBiopsy | ageBiopsy | Sample charactertics |
| smellTestScore | smellTestScore | Sample charactertics |
| smoker | smoker | Sample charactertics |
| Capstone_4 | Capstone_4 | Sample charactertics |
| NA | Period | Phenotype |
| NA | AgeInterval | Phenotype |
| NA | Regions | Sample charactertics |
write.csv(bspan, file = "/home/neuro/Documents/BrainData/Bulk/GTEx/Formatted/GTEx-metadata-subset.csv")
#bseq <- read.csv("datasets/FormattedData/FormattedData/BrainSeq/BrainSeq-exp.csv", row.names =2)[,-1]
#ead(bseq)[1:10]
#bseq.exp <- bseq[apply(bseq >= 1, 1, sum) >= 0.1*ncol(bseq),]
#bseq.md <- read.csv("datasets/FormattedData/FormattedData/BrainSeq/BrainSeq-metadata-subset.csv", row.names = 1)
#head(bseq.md)
#form.bseq <- ~ AgeNumeric + (1|StructureAcronym) + (1|Sex) + RIN + (1|Diagnosis) + mito_Rate + rRNA_rate + TotalNReads + MappingRate
#+ Adult.Oligo + Adult.Microglia + Adult.Endothelial
#varPar.bseq <- fitExtractVarPartModel(bseq.exp, form.bseq, bseq.md)
#bspan.exp = read.csv("/home/neuro/Documents/BrainData/Bulk/BrainSpan/Formatted/BrainSpan-exp.csv", row.names = 1, check.names = FALSE) #%>%
# column_to_rownames("EnsemblID")
#bspan.meta = read.csv("/home/neuro/Documents/BrainData/Bulk/BrainSpan/Formatted/BrainSpan-metadata.csv")
#bspan.exp <- bspan.exp[apply(bspan.exp >= 1, 1, sum) >= 0.1*ncol(bspan.exp),]
#form.bspan <- ~ AgeNumeric + (1|StructureAcronym) + (1|Sex) + (1|Period) + (1|Regions)
#form.bspan <- ~ AgeNumeric + (1|StructureAcronym) + (1|Sex) + (1|Period) + (1|Regions)
#varPar.bspan <- fitExtractVarPartModel(bspan.exp, form.bspan, bspan.meta)
#gtex.exp <- read.csv("datasets/FormattedData/FormattedData/Gtex/GTEx-exp.csv", row.names = 1)
#gtex.md <- read.csv("datasets/FormattedData/FormattedData/Gtex/GTEx-metadata-subset.csv")
#colnames(gtex.md)
#gtex.exp <- gtex.exp[apply(gtex.exp >= 1, 1, sum) >= 0.1*ncol(gtex.exp),]
#gtex.form <- ~ TotalNReads + rRNA_rate + (1|TypeofBatch) + (1|DateofBatch) + (1|BSS_Collection_side_code) + (1|AgeInterval) + (1|Sex) + #(1|Regions) + IntergenicRate + RIN
#varPar.gtex <- fitExtractVarPartModel(gtex.exp, gtex.form, gtex.md)
#vp <- sortCols(varPar.gtex)
#plotVarPart(vp)
#write.csv(vp, "GTEx-varPart.csv")